LSQ: Learned Step Size Quantization
19
FIGURE 2.1
Computation of a low-precision convolution or fully connected layer, as envisioned here.
This technique uses low-precision inputs, represented by ¯w and ¯x, in matrix multiplication
units for convolutional or fully connected layers in deep learning networks. The low-precision
integer matrix multiplication units can be computed efficiently, and a step size then scale
the output with a relatively low-cost high-precision scalar-tensor multiplication. This scaling
step has the potential to be combined with other operations, such as batch normalization,
through algebraic merging, as shown in Fig. 2.1. This approach minimizes the memory and
computational costs associated with matrix multiplication.
2.2.2
Step Size Gradient
LSQ offers a way of determining s based on the training loss through the incorporation of
a gradient into the step size parameter of the quantizer as:
∂ˆv
∂s =
⎧
⎨
⎩
−v/s + ⌊v/s⌉,
if −QN < v/s < Qp,
−QN,
if v/s ≤x,
QP ,
if v/s ≥Qp.
(2.10)
The gradient is calculated using the straight-through estimator, as proposed by [9], to
approximate the gradient through the round function as a direct pass. The round function
remains unchanged to differentiate downstream operations, while all other operations are
differentiated conventionally.
The gradient calculated by LSQ is different from other similar approximations (Fig.
2.2) in that it does not transform the data before quantization or estimate the gradient by
algebraically canceling terms after removing the round operation from the forward equation
resulting in ∂ˆv/∂s = 0 when −QN < v/s < QP [43]. In these previous methods, the
proximity of v to the transition point between quantized states does not impact the gradient
of the quantization parameters. However, it is intuitive that the closer a value of v is to a
quantization transition point, the more likely it is to change its quantization bin ˆv with a
small change in s, resulting in a large jump in ˆv. This means that ∂ˆv/∂s should increase
as the distance from v to a transition point decreases, as observed in the LSQ gradient.
Notably, this gradient emerges naturally from the simple quantizer formulation and the use
of the straight-through estimator for the round function.
In LSQ, each layer of weights and each layer of activations have their unique step size
represented as a 32-bit floating point value. These step sizes are initialized to 2|v|/√QP and
are calculated from the initial weight values or the first batch of activations, respectively.